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Abstract 

The Genographic Project is an international effort aimed at charting human migratory history. The project is nonprofit and non- 
medical, and, through its Legacy Fund, supports locally led efforts to preserve indigenous and traditional cultures. Although the first 
phase of the project was focused on uniparentally inherited markers on the Y-chromosome and mitochondrial DNA (mtDNA), the 
current phase focuses on markers from across the entire genome to obtain a more complete understanding of human genetic 
variation. Although many commercial arrays exist for genome-wide single-nucleotide polymorphism (SNP) genotyping, they were 
designed for medical genetic studies and contain medically related markers that are inappropriate for global population genetic 
studies. GenoChip, the Genographic Project's new genotyping array, was designed to resolve these issues and enable higher res- 
olution research into outstanding questions in genetic anthropology. The GenoChip includes ancestry informative markers obtained 
for over 450 human populations, an ancient human (Saqqaq), and two archaic hominins (Neanderthal and Denisovan) and was 
designed to identify all known Y-chromosome and mtDNA haplogroups. The chip was carefully vetted to avoid inclusion of medically 
relevant markers. To demonstrate its capabilities, we compared the F ST distributions of GenoChip SNPs to those of two commercial 
arrays. Although all arrays yielded similarly shaped (inverse J) F ST distributions, the GenoChip autosomal and X-chromosomal distri- 
butions had the highest mean F ST , attesting to its ability to discern subpopulations. The chip performances are illustrated in a principal 
component analysis for 14 worldwide populations. In summary, the GenoChip is a dedicated genotyping platform for genetic 
anthropology. With an unprecedented number of approximately 12,000 Y-chromosomal and approximately 3,300 mtDNA SNPs 
and over 1 30,000 autosomal and X-chromosomal SNPs without any known health, medical, or phenotypic relevance, the GenoChip 
is a useful tool for genetic anthropology and population genetics. 
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Introduction 

Apportionment of human genetic variation has long estab- 
lished that all living humans are related via recent common 
ancestors who lived in sub-Saharan Africa some 200,000 



years ago (Cann et al. 1987). The world outside Africa 
was settled over the past 50,000-100,000 years (Henn et al. 
201 0) when the descendents of our African forebears spread 
out to populate other continents (Cavalli-Sforza 2007). 



© The Author(s) 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.Org/licenses/by-nc/3.0/), which permits 
non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contactjournals.permissions@oup.com 



GenomeBiol. Evol. 5(5): 1 021-1 031 . doi:10.1093/gbe/evt066 Advance Access publication May 9, 2013 



1021 



Elhaik eta 



GBE 



This "Out-of-Africa" hypothesis, backed by archeological 
findings (Klein 2008) and genetic evidence (Stringer and 
Andrews 1988; Laval et al. 2010), describes a major dispersal 
of anatomically modern humans that completely replaced 
local archaic populations outside Africa, although a scenario 
involving Europeans and West Africans admixing with extinct 
hominins was also proposed (Plagnol and Wall 2006). 
Remarkably, recent studies proposed evidence for two such ar- 
chaic admixture (interbreeding) events, one with Neanderthals 
in Europe and eastern Asia (Green et al. 201 0) and the second 
with Denisovans in Southeast Asia and Oceania (Reich et al. 
201 1), though the extent of the hybridization remains ques- 
tionable (Eriksson and Manica 2012). Overall, the recurrent 
migrations, admixture, and interbreeding events shaped the 
autosomes of modern populations into mosaics of ancient and 
recent alleles harbored in haplotypes that vary in size but not 
in the building blocks themselves. These subtle differences in 
autosomal allele frequency between populations together 
with uniparental markers provide genetic data with the po- 
tential to obtain evidence of mixing and migration of human 
populations. 

The advent of microarray single-nucleotide polymorphism 
(SNP) technology that revolutionized human population ge- 
netics and broadened our understanding of genetic diversity 
largely skipped genetic anthropology for three main reasons: 
first, only a handful of the estimated 5,000-6,000 indigenous 
population groups (Burger and Strong 1990; Fardon 2012) 
were genotyped and studied, which may limit the phylogeo- 
graphic resolution of the findings. Second, the plethora of 
genetic markers obtained from different genotyping platforms 
has resurrected the "empty matrix" problem, whereby pop- 
ulations from different studies can barely be compared due 
to the low overlap of these platforms. Finally, genotyping 
costs remained prohibitively high and unjustified for genetic 
anthropology, as the commercial genotyping platforms, by 
large, do not accommodate ancestry informative markers 
(AIMs). Furthermore, these arrays are enriched in trait- or dis- 
ease-related markers, which prompt a host of psychological, 
social, legal, political, and ethical concerns from the individual 
to the population and global levels (Royal et al. 2010). 

The first phase of The Genographic Project focused on re- 
constructing human migratory paths through the analysis of 
uniparentally inherited markers on the Y-chromosome and 
mitochondrial DNA (mtDNA). The success of the project in 
both inferring details of human migratory history (e.g., 
Balanovsky et al. 2011; Schurr et al. 2012) and attracting 
over half a million public participants interested in tracing 
their genetic ancestry has prompted entrepreneurs to offer 
multiple self-test kits that provide information ranging from 
disease risk and life-style choices (e.g., diet) to genetic ancestry 
(Wolinsky 2006). Some of these solutions have been criticized 
for making deceptive health-related claims and providing 
limited and imprecise answers regarding ancestry (Royal 
et al. 2010). The concerns about ancestry reporting were 



not unjustified, as these entrepreneurs adopted the commer- 
cial genotyping platforms that were fraught with medically 
informative markers, depleted of AIMs, and overall yielded 
biased measures of genetic diversity (Albrechtsen et al. 201 0). 

Although uniparental arrays do not suffer from the afore- 
mentioned predicaments, they are limited in that they repre- 
sent only a smaller and more ancient portion of our history 
and ignore our remaining ancestors whose contribution to our 
genome was more recent and substantial. In contrast, assess- 
ment of the spatial and temporal patterns of genetic variation 
in the rest of the genome coupled with data obtained from 
other disciplines can provide more information of our ances- 
tors. However, autosomal-driven studies attempting to discern 
markers informative to genetic anthropology from those 
having medical relevance often met with legal or ethical 
obstacles and failed to attract participants who remained 
concerned about the sharing and potential exploitation of 
their medical information (Royal et al. 201 0). These constraints 
render all commercial genotyping arrays unsuitable for genetic 
anthropology, including the Human Origins array (Lu et al. 
201 1) that contains coding and medically related markers. 

To facilitate high-quality research in genetic anthropology 
without obtaining health, trait, or medical information, we 
resolved to develop a novel genotyping array — which we 
call the GenoChip. Our goals were to 1) design a state of 
the art SNP array dedicated solely to genetic anthropology, 
2) validate its accuracy, 3) evaluate its abilities to discern pop- 
ulations compared with alternative arrays, and 4) demonstrate 
its performances on worldwide populations. 

Materials and Methods 

Genotype Data Retrieval 

AIMs were obtained from 1 5 studies (Yang et al. 2005; Price 
et al. 2007, 2008; Haider et al. 2008; Tian et al. 2008, 2009; 
Florez et al. 2009; Kosoy et al. 2009; McEvoy et al. 2009, 
201 0; Nassir et al. 2009; Henn et al. 201 1 ; Kidd et al. 201 1 ). 

Genotype data for thousands of samples from over 300 
worldwide populations were obtained from 15 public and 
private collections (Conrad et al. 2006; Reich et al. 2009; 
Silva-Zolezzi et al. 2009; Teo et al. 2009; Xing et al. 2009, 
2010; Altshuler et al. 2010; Behar et al. 2010; Hunter-Zinck 
et al. 2010; Rasmussen et al. 2010, 2011; Chaubey et al. 
2011; Hatin et al. 2011; Henn et al. 2011; Yunusbayev 
et al. 2012) and the FamilyTreeDNA collection. To study 
gene flow from apes, ancient hominins, and modern 
humans, we used the data set of 257,000 high-quality 
autosomal SNPs assembled by Reich et al. (2010). 

SNP Validation 

To cross-validate the GenoChip's autosomal genotypes, we 
genotyped 168 samples from 14 worldwide populations of 
the 1000 Genomes Project including Americans of African 
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ancestry from Southwest United States (ASW), Americans of 
Mexican ancestry from Los Angeles, CA(MEX), Utah residents 
with Northern and Western European ancestry from UT(CEU), 
England and Scotland British (GBR), Finnish from Finland (FIN), 
Gujarati Indians from Houston, TX (GIH), Han Chinese from 
Beijing, China (CHB), Iberians from Spain (IBS), Italians from 
Tuscany, Italy (TSI), Japanese from Tokyo, Japan (JPT), Kinh 
from Ho Chi Minh City, Vietnam (KHV), Luhya in Webuye, 
Kenya (LWK), Peruvians from Lima, Peru (PEL), and Yoruba 
in Ibadan, Nigeria (YRI). The concordance rate between 
GenoChip and the 1 000 Genomes Project genotypes was cal- 
culated as the proportion of genotypes that were identical 
between the two data sets. 

Comparing Population Genetic Summary Statistics 
between Genotyping Arrays 

To compare the performances of the validated approximately 
130,000 autosomal and X-chromosomal SNPs of the 
GenoChip array to commercial arrays, we obtained the list 
of SNPs for the lllumina Human660W-Quad BeadChip 
(544,366 SNPs) from lllumina and the Affymetrix Axiom 
Human Origins array (627,719 SNPs) available at ftp://ftp. 
cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz 
(last accessed May 19, 2013). Because of the lack of overlap 
between these genotyping arrays, we used subsets of data 
calculated for HapMap III populations. Minor allele frequency 
(MAF) and F ST estimates for African, European, and Asians 
were obtained from the "continental" HapMap data set, as 
described in Elhaik (201 2). Briefly, genotype data of 602 unre- 
lated individuals from eight populations (YRI, LWK, Maasai in 
Kinyawa, Kenya [MKK], CEU, TSI, CHB, Chinese from metro- 
politan Denver, Colorado [CHD], and JPT) were downloaded 
from the International HapMap Project web site (phase 3, 
second draft) (Altshuler et al. 2010), passed through rigorous 
filtering criteria, and finally merged into continental popula- 
tions (African [288], European [144], and Asian [170]). The 
final continental data set consisted of 3 million SNPs geno- 
typed in at least one population from each continent. 

The MAF and F ST values of the continental data set for 
autosomal (2,823,367) and X-chromosomal (86,449) SNPs 
were compared with those obtained from GenoChip 
(126,425 and 2,421 SNPs, respectively), lllumina 
Human660W (541,104 and 12,916 SNPs, respectively), and 
Affymetrix Axiom Human Origins Array (308,949 and 2,984 
SNPs, respectively). 

Because of the large number of F ST values in each data set, 
their length distributions are very noisy. We thus adopted a 
simple smoothing approach in which F ST values are sorted and 
divided to 1,000 equally sized subsets. The distribution of the 
mean F ST value is then calculated using a histogram with 40 
equally sized bins ranging from 0 to 1 . To test whether two 
such F ST distributions obtained by different arrays are different, 
we used the Kolmogorov-Smirnov goodness-of-fit test 



and the false discovery rate correction for multiple tests 
(Benjamini and Hochberg 1995). Because the differences 
between the distributions were highly significant due to the 
large sample sizes, we also calculated the effect size, first by 
using the nonoverlapping percentage of the two distributions, 
and then by using Hedges' g estimator of Cohen's d (Hedges 
1 981 ). If the area overlap is larger than 98% and Cohen's d is 
smaller than 0.05, we consider the magnitude of the differ- 
ence between the two distributions to be too small to be 
biologically meaningful. 

Principal components analysis (PCA) calculations were car- 
ried out using smartpca of the EIGENSOFT package (Patterson 
et al. 2006). Polygons were drawn manually around popula- 
tions clustered separately from one another. 

Results and Discussion 

Designing the GenoChip 
Choosing the Markers 

The GenoChip was designed as an lllumina iSelect HD custom 
genotyping bead array that offers the ability to interrogate 
almost any SNP. In designing the chip, we endeavored to 
identify the fewest possible SNPs that offer an increased 
power for ancestry inference in comparison to random mar- 
kers (Royal et al. 2010). SNPs that discern and identify popu- 
lations are termed AIMs and are considered invaluable tools in 
population genetics and genetic anthropology. Half of our 
AIMs were culled from the literature, and the remaining 
were calculated using our novel AIMsFinder based on an 
approach described by Elhaik (2013) and infocalc 
(Rosenberg 2005) (supplementary text S1, Supplementary 
Material online). These two methods were applied on global 
panels comprising over 300 populations (supplementary table 
S1, Supplementary Material online) assembled from public 
and private data sets that were genotyped on a diversified 
set of arrays ranging from 30,000 to more than million SNPs 
in size. Many of these populations are unique to our project 
and have never before studied or searched for AIMs. Because 
AIMsFinder infers the minimal number of markers necessary 
to discern two genetically distinct populations, it was applied 
in a pairwise fashion over all the population data sets. In con- 
trast, infocalc that ranks SNPs by their informativeness to an- 
cestry was applied to whole population panels organized by 
the source of the genotype data (supplementary table S1, 
Supplementary Material online), where the top 1 % of the 
results was considered AIMs. Overall, we ascertained over 
80,000 autosomal and X-chromosomal AIMs from over 450 
worldwide populations (fig. 1). 

To facilitate studies on the extent of gene flow from 
Neanderthal and Denisovan to modern humans, we collected 
from the literature SNPs and haplotypes from genomic regions 
bearing evidence of interbreeding (Noonan et al. 2006; Green 
et al. 2010; Yotova et al. 2011). In addition, we used a 
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Fig. 1. — Worldwide distribution of population from which AIMs were obtained. AIMs from over 450 world populations were harvested from the 
literature (green) and calculated based on genotyped data from public and private collections (red) including over 30 Jewish populations (blue). 



modified version of lsoPlotter+ (Elhaik et al. 2010; Elhaik and 
Graur 2013) to identify regions in which modern humans and 
Neanderthals share the derived allele and chimpanzees and 
Denisovans share the ancestral allele (supplementary text S1, 
Supplementary Material online). Using the same approach, we 
identified SNPs within regions enriched for the Denisovan 
shared derived alleles with humans. Overall, we included 
nearly 26,000 autosomal and X-chromosomal SNPs from po- 
tential interbreeding hotspots with extinct hominins. To sup- 
port studies of more recent gene flow from ancient to modern 
humans, we included approximately 10,400 high-confidence 
Paleo-Eskimo Saqqaq SNPs (Rasmussen et al. 2010). In addi- 
tion, we included approximately 12,000 high-confidence 
Aboriginal SNPs (Rasmussen etal. 201 1). High-linkage disequi- 
librium (LD) SNPs (a 2 >0.4) were excluded in all populations, 
by choosing a random SNP of the high-LD pair, except for 
hunter gatherers such as the Hadza and Sandawe of 
Tanzania (Tishkoff and Williams 2002) and Melanesian popu- 
lations (Conrad et al. 2006) that are used to infer interbreed- 
ing with extinct hominins (Reich et al. 2010; Lachance et al. 
2012). 

To support potential imputation efforts, we supplemented 
regions of low SNP density (< 1 SNP over 1 00,000 bases) with 
random common SNPs from HapMap III (1,000 SNPs with 
MAF>20%) and the 1000 Genomes Project (3,500 SNPs 
with MAF> 10% in at least one continental population). To 
prevent false positives, we included mostly SNPs observed in 
both the HapMap III and 1000 Genome Project data sets 
(Altshuler et al. 2010; Durbin et al. 2010). We further elimi- 
nated A/T and C/G SNPs to minimize strand misidentification. 




Position (Mb) 



Fig. 2. — SNP density in the Genochip. The average numbers of 
GenoChip SNPs per 100,000 nucleotides across the genome are color 
coded. Gaps in the assembly are shown in gray. 



The resulting chip has a SNP density of at least 1/1 00 kilobases 
over 92% of the assembled human genome (hg19) (fig. 2), 
including regions uncharted by the HapMap (l-lll) and HGDP 
projects (Conrad et al. 2006; Altshuler et al. 2010). This high 
density of the chip and the excess inclusion of AIMs make it 
suitable for imputation, particularly for common markers 
(Pasaniuc etal. 2012). 
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Finally, we constructed over 45,000 probes to identify SNPs 
defining all known Y-chromosome and mtDNA haplogroups, 
many of which were not reported in the literature (supple- 
mentary text S2, Supplementary Material online). 

Compatibility to Commercial Genotyping Arrays 

Looking at autosomal and X-chromosomal SNPs, the 
GenoChip is highly compatible with other commercial arrays. 
Some 76% of our SNPs overlap with those in the lllumina 
Human 660W-Quad array, 55% overlap with the lllumina 
HumanOmni1-Quad, lllumina Express, and Affymetrix 6.0 
arrays, and 40% overlap with the Affymetrix 5.0 and 
Affymetrix Human Origins arrays. With the exception of dedi- 
cated Y chromosome and mtDNA chips, the GenoChip in- 
cludes the most comprehensive collection of uniparental 
markers. 

Vetting the Chip for Health, Trait or Medical Markers 

Several steps were taken to ensure that the genetic results 
would not be exploited for pharmaceutical, medical, and 
biotechnological purposes. First, participant samples were 
maintained in complete anonymity during GenoChip analysis. 
Second, no phenotypic or medical data were collected from 
the participants. Third, we included only SNPs in noncoding 
regions without any known functional association (Graur et al. 
2013), as reported in dbSNP build 132. Last, we filtered our 
SNP collection against a 1 .5 million SNP data set (Pheno SNPs) 
containing all variants that have potential, known, or sus- 
pected associations with diseases. 

To construct the Pheno SNPs data set, we extracted SNPs 
from multiple open-access databases including the Online 
Mendelian Inheritance in Man (OMIM) (http:/A/vww.ncbi. 
nlm.nih.gov/omim/, last accessed May 19, 2013), the Cancer 
Genome Atlas (Hudson et al. 201 0), PhenCode (Giardine et al. 

2007) , the National Human Genome Research Institute 
(NHGRI) Genome-Wide Association Studies (GWAS) Catalog 
(Hindorff et al. 2009), The Genetic Association Database 
(Becker et al. 2004), MutaGeneSys (Stoyanovich and Pe'er 

2008) , GWAS Central (Thorisson et al. 2009), and SNPedia, 
as well as SNPs identified in the major histocompatibility com- 
plex (MHC) region. We also excluded SNPs reported to be 
associated with phenotypic traits. Finally, to circumvent impu- 
tation efforts toward inferring potential medical-relevant 
SNPs, we excluded SNPs that were in high LD (r 2 > 0.8) with 
the Pheno SNPs. 

We thus designed the first genotyping array dedicated 
for genetic anthropological and genealogical research that 
is suitable for detecting gene flow from archaic hominins 
and ancient humans into modern humans as well as between 
worldwide populations. The final GenoChip has over 1 30,000 
highly informative autosomal and X-chromosomal markers, 
approximately 12,000 Y-chromosomal markers, and approxi- 
mately 3,300 mtDNA markers without any known health, 



medical, or phenotypic relevance (supplementary table S2, 
Supplementary Material online). 

Validating the GenoChip Results 

The accuracy of the autosomal genotypes obtained by 
the GenoChip was assessed by genotyping 168 worldwide 
samples from the 1000 Genomes Project and cross-validating 
the results. The concordance rate per sample was over 99.5%. 
We did not observe any position with mismatching homozy- 
gote alleles. The marginal error rate was expected due to the 
low coverage of the 1000 Genomes Project data, particularly 
for rare alleles (Durbin et al. 2010). We thus confirmed that 
genotypes reported by the GenoChip are accurate. 

The ability of the GenoChip to infer uniparental hap- 
logroups was similarly assessed by genotyping 400 additional 
samples with known haplogroups. The haplotypes of these 
samples were confirmed by Sanger sequencing of the full 
mitochondrial genome and all relevant Y chromosome SNP 
locations that determined the exact haplogroup down to the 
last branch of the published Y-chromosomal tree (supplemen- 
tary text S2, Supplementary Material online). The average 
success rates for the paternal and maternal haplogroups 
were 82% and 90%, respectively (fig. 3). The reasons for 
our inability to validate the remaining haplogroups are the 
unavailability of control samples to identify deeper splits in 
the tree. Moreover, some haplogroups cannot be measured 
with the lllumina bead chip technology because they are not 
represented by a real SNP but rather by large-scale variations 
of repetitive elements. We note that some of the failed 
markers for particular haplogroups can be substituted by 
phylogenetically equivalent markers and rescue these hap- 
logroups, although formally they were counted as missing. 
Our experience with the tens of thousands of GenoChip 
participants indicates that most samples (>99%) are classified 
on haplogroup branches that are perfectly captured by the 
GenoChip. The remaining users for which the exact position 
along the tree cannot be assigned (e.g., R-P31 2*) are classified 
to a higher level haplogroup (e.g., R-P310). A large-scale 
genotyping effort to validate the remaining haplogroups is 
undergoing. We thus confirmed that GenoChip produces 
highly accurate results and has broad coverage for markers 
defining Y-chromosome and mtDNA haplogroups. 

Testing the GenoChip's Abilities to Discern Populations 
MAP Distribution 

Before comparing the ability of the GenoChip SNPs to discern 
populations, we compared the similarity of their MAF distri- 
bution with those of the lllumina Human660W and Affymetrix 
Human Origins SNP arrays. Because of the low overlap of 
these three arrays, we obtained and analyzed genotype data 
from eight HapMap populations. The results of the complete 
set of HapMap markers were compared with three subsets of 
markers that overlapped with those of each array. 
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Fig. 3. — Success rate in identifying Y-chromosomal (left) and mtDNA (right) haplogroups. The plots depict all known basal haplogroups (columns), the 
number of known subgroups in each haplogroup (top of each column), and the proportion of subgroups that were validated with the GenoChip. 




Fig. 4. — MAF distributions for autosomal (a) and X-chromosomal (b) HapMap SNPs. MAF distributions are shown for HapMap SNPs and two subsets 
that overlap with the lllumina Human660W and GenoChip SNPs. 



A comparison of the MAF distributions of the three 
arrays revealed gross differences in allele frequencies (fig. 4, 
supplementary fig. S1, Supplementary Material online). In the 
HapMap data set, over 82% of the SNPs are common 
(MAF > 0.05) and less than 5% are considered rare 
(MAF < 0.01). The proportion of common SNPs in all the 
arrays is similar (96-98%), but the GenoChip is enriched for 
the most common SNPs (MAF > 0.25). Because of the high 
frequency of the rare ENCODE SNPs in the HapMap data set, 
none of the arrays resembled the shape of the HapMap's MAF 
distribution. Nonetheless, both the Human660W (0.07%) and 



Human Origins (0.36%) arrays are enriched in rare SNPs com- 
pared with the GenoChip (0.008%). Similar trends were ob- 
served for X-chromosomal SNPs. Here, the HapMap data set 
consisted of 83% common SNPs, compared with 93% for the 
GenoChip and 96% for the commercial arrays. The GenoChip 
array exhibits similar enrichment in the most common SNPs 
(MAF > 0.3), but unlike the commercial arrays, it also consists 
of 1% extremely rare SNPs due to the inclusion of rare hap- 
lotypes speculated to indicate interbreeding with archaic 
hominins. Altogether, the MAF distributions of the three 
arrays differ from the HapMap MAF distribution and 
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correspond to the choices of SNP ascertainment made in the 
design of each array. 

Genomewide F ST Distribution 

To assess the extent of genetic diversity that can be inferred 
among human subpopulations by the different arrays, we 
next compared their F ST distributions (Wright 1951). F ST mea- 
sures the differentiation of a subpopulation relative to the 
total population and is directly related to the variance in 
allele frequency between subpopulations, such that a high 
F ST corresponds to a larger difference between subpopulations 
(Holsinger and Weir 2009). Elhaik (2012) used 1 million mar- 
kers that were genotyped in 602 HapMap samples from eight 
populations to carry out a two-level hierarchical F ST analysis. 
He showed that the greatest proportion of genetic variation 
occurred within individuals residing in the same populations, 
with only a small amount (12%) of the total genetic variation 
being distributed between continental populations and even a 
lesser amount (1 %) between intracontinental populations. An 
F SJ distribution for three continental populations employing 3 
million HapMap SNPs yielded an even lower estimate (8%) to 
the proportion of genetic variation distributed between 
continental populations due to the large number of rare alleles 
(Elhaik 2012). 

In a similar manner to (Elhaik 201 2) later analysis, we used 
the F S t values calculated for eight HapMap populations 
grouped into three continental populations to create three 
subsets for the markers that overlap with each array. 
Although all F S t distributions were similar in shape to the 
HapMap F ST distribution, they differed in their means (fig. 5, 
supplementary fig. S2, Supplementary Material online). The 
autosomes and X-chromosomal SNPs of the commercial 
arrays have significantly lower F ST values (Kolmogorov- 
Smirnov good ness-of -fit test, P<0.05) than that of the 



GenoChip due to the high fraction of rare uninformative 
SNPs in these arrays. The magnitude of the differences be- 
tween the F S t values of the GenoChip to those of the com- 
mercial arrays were also large for autosomal (area overlap 86- 
91%, Cohen's d 0.09-0.13) and X-chromosomal SNPs (area 
overlap 93%, Cohen's d 0.09-0.1 1). These results suggest a 
reduced ability of the commercial arrays to elucidate ancient 
demographic processes (Kimura and Ota 1973; Watterson 
and Guess 1977). 

The lllumina Human660W array had the highest fraction of 
low-F ST alleles, suggesting it is the least suitable for population 
genetic studies compared with the GenoChip and Human 
Origins. As only half of the Human Origins SNPs could be 
tested, it is difficult to evaluate its performance. However, 
we speculate that the large number of rare alleles reflect 
the private alleles of the dozen populations used for its ascer- 
tainment. Because the MAF and F ST were not used as filtering 
criteria for the GenoChip SNPs, we can conclude that its en- 
richment toward high-F ST SNPs mirrors the success of the as- 
certainment process and its potential for population genetic 
studies. 

Genetic Diversity in Worldwide Populations 

Last, PCA (Price et al. 2006) was used to explore the extent of 
population differentiation between 14 worldwide populations 
that were genotyped on the GenoChip in the validation stage 
(fig. 64). The samples aligned along the two well-established 
geographic axes of global genetic variation: PC1 (sub-Saharan 
Africa vs. the rest of the Old World) and PC2 (east vs. west 
Eurasia) (e.g., Li et al. 2008; Elhaik 2013). GenoChip results 
reveal geographically refined groupings of Eastern (Luhya) and 
Western (Yoruba) Africans, Eastern (Chinese and Japanese) 
and South Eastern (Vietnamese) Asians, Amerindian 
(Peruvians Mexicans) and Indian populations, and finally 



|]lumma66OW^-0X)7 I 
fitinoChip 2M I -n I 




^ 



* 1 


(b) 




0.25 - 


-OK 


0.2- 


1 








-0.6 FT 


% ii.L5 


g 


§ 


a 


a* 
I 


HA \ 
= 


e£ ua- 


■Li 









ft. I 0,2 0,3 0,4 0,5 0.6 0,7 0,S 0,9 

Fir 



-0,2 




HapMap F W "QM | 
Human Origins £„ -0,0® 1 
GenoChip 2 ^=0.1 I 



1 i.N 

! 

■0.6| 



-0.2 



Oh I 0,2 Oj 0.4 0.5 0.6 0,7 0.K 0 » I 



Fig. 5. — Distribution of locus-specific F ST in three continental populations. F ST values were obtained for (a) autosomal and (b) X-chromosomal HapMap 
SNPs. F S t distributions are shown for HapMap SNPs and two subsets that overlap with the lllumina Human660W and GenoChip SNPs. The histograms show 
bin distribution as indicated on the x axis and the cumulative distribution (line). 



GenomeBiol. Evol. 5(5): 102 1-1 03 1 . doi:10.1093/gbe/evt066 Advance Access publication May 9, 2013 



1027 



Elhaik eta 



GBE 




B 



2 ' 



,.CHB 



JPT 



C u 



KHV " 



\1S\ 



MXL 



GBR 



GIH 



IBS 



CEU 



fci %$m 



PCI 2,3% 



PCI 1,7% 



Fig. 6. — PCA plots of genetic diversity across 14 worldwide populations. Each figure represents the genetic diversity seen across the populations 
considered, with each sample mapped onto a spectrum of genetic variation represented by two axes of variations corresponding to two eigenvectors of the 
PCA. Individuals from each population are represented by a unique color. (A) Analysis of all populations. The insets magnify European, Asian, and the cluster 
of Amerindian and Indian individuals. (B) Analysis of East Asian individuals. (0 Analysis of European individuals. (D) Analysis of Amerindian and Indian 
individuals. A polygon surrounding all or most of the individuals belonging to a group designation highlights the population groups. 



Northern (Finnish), Southern (Italian and Iberians), and 
Western (British and CEU) Europeans. As expected, the 
Amerindian populations form a gradient along the diagonal 
line between European and East Asians based on their dom- 
inant ancestry as did the African Americans along the diagonal 
line between Africans and Europeans. These patterns are sim- 
ilar to those observed in worldwide populations using com- 
mercial arrays (e.g., Teo et al. 2009; Xing et al. 2010). 



When we consider only the East Asian populations 
(comprising CHB, JPT, and KHV), the first and second axes 
of variation completely separated the three populations 
(fig. 6B), in agreement with Teo et al. (2009). In a similar 
manner, we were able to differentiate Gujarati Indians and 
Americans of Mexican ancestry (fig. 60, as well as Italians, 
Iberians, and Western European populations (fig. 6D), with 
the exception of one TSI outlier. As expected, some overlap 
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was observed between individuals of Northern and Western 
European ancestry (CEU) and British (GBR). 

Conclusions 

To summarize, we designed, developed, validated, and tested 
the GenoChip, the first genotyping chip completely dedicated 
to genetic anthropology. The GenoChip will help to clarify 
the genetic relationships between archaic hominins such as 
Neanderthal and Denisovan, extinct humans, and modern 
humans as well as to provide a more detailed understanding 
of human migratory history. We compared the MAF and F S t 
distributions of the GenoChip SNPs to those of HapMap and 
two commercially available arrays and demonstrated the 
ability of the GenoChip to differentiate subpopulations 
within global data sets. We expect that the expanded use of 
the GenoChip in genetic anthropology research will expand 
our knowledge of the history of our species. 

Supplementary Material 

Supplementary text S1 and S2, tables S1 and S2, and figures 
S1-S4, and are available at Genome Biology and Evolution 
online (http://www.gbe.oxfordjournals.org/). 
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